Dirichlet Mixtures in Text Modeling

نویسندگان

  • Mikio Yamamoto
  • Kugatsu Sadamitsu
چکیده

Word rates in text vary according to global factors such as genre, topic, author, and expected readership (Church and Gale 1995). Models that summarize such global factors in text or at the document level, are called ‘text models.’ A finite mixture of Dirichlet distribution (Dirichlet Mixture or DM for short) was investigated as a new text model. When parameters of a multinomial are drawn from a DM, the compound for discrete outcomes is a finite mixture of the Dirichlet-multinomial. A Dirichlet multinomial can be regarded as a multivariate version of the Poisson mixture, a reliable univariate model for global factors (Church and Gale 1995). In the present paper, the DM and its compounds are introduced, with parameter estimation methods derived from Minka’s fixed-point methods (Minka 2003) and the EM algorithm. The method can estimate a considerable number of parameters of a large DM, i.e., a few hundred thousand parameters. After discussion of the relationships within the DM — probabilistic latent semantic analysis (PLSA) (Hofmann 1999), the mixture of unigrams (Nigam et al. 2000), and latent Dirichlet allocation (LDA) (Blei et al. 2001, 2003) — the products of statistical language modeling applications are discussed and their performance in perplexity compared. The DM model achieves the lowest perplexity level despite its unitopic nature.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Infinite Dirichlet Mixtures in Text Modeling

This paper proposes a Dirichlet process mixture modeling approach to Dirichlet Mixtures (DM). Endowing a prior distribution on an infinite number of mixture components, this approach yields an appropriate number of components as well as their parameters at the same time. Experimental results on amino acid distributions and text corpora confirmed this effect and showed comparative performance on...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Nested Hierarchical Dirichlet Processes for Multi-Level Non-Parametric Admixture Modeling

Dirichlet Process(DP) is a Bayesian non-parametric prior for infinite mixture modeling, where the number of mixture components grows with the number of data items. The Hierarchical Dirichlet Process (HDP), often used for non-parametric topic modeling, is an extension of DP for grouped data, where each group is a mixture over shared mixture densities. The Nested Dirichlet Process (nDP), on the o...

متن کامل

Dirichlet Mixtures for Query Estimation in Information Retrieval

Treated as small samples of text, user queries require smoothing to better estimate the probabilities of their true model. Traditional techniques to perform this smoothing include automatic query expansion and local feedback. This paper applies the bioinformatics smoothing technique, Dirichlet mixtures, to the task of query estimation. We discuss Dirichlet mixtures’ relation to relevance models...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005